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Abstract 

There are non-Gaussian time series that admit a causal linear au- 
toregressive moving average (ARMA) model when regressing the fu- 
ture on the past, but not when regressing the past on the future. The 
reason is that, in the latter case, the regression residuals are only un- 
correlated but not statistically independent of the future. In previous 
work, we have experimentally verified that many empirical time series 
indeed show such a time inversion asymmetry. 

For various physical systems, it is known that time-inversion asym- 
metries are linked to the thermodynamic entropy production in non- 
equilibrium states. Here we show that such a link also exists for the 
above unidirectional linearity. 

We study the dynamical evolution of a physical toy system with 
linear coupling to an infinite environment and show that the linearity 
of the dynamics is inherited to the forward-time conditional probabil- 
ities, but not to the backward-time conditionals. The reason for this 
asymmetry between past and future is that the environment perma- 
nently provides particles that are in a product state before they in- 
teract with the system, but show statistical dependencies afterwards. 
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From a coarse-grained perspective, the interaction thus generates en- 
tropy. We quantitatively relate the strength of the non-linearity of the 
backward conditionals to the minimal amount of entropy generation. 

1 Unidirectional linearity in time series 

To study the implications and the different versions of the thermodynamic 
arrow of time has attracted interest of theoretical physicists and philosophers 
since a long time [H El El IH El El E] • More specificly, it is the question how 
the difference between time reversibility of microscopic physical dynamics 
is consistent with the existence of irreversible processes on the macroscopic 
level. The most prominent examples of irreversibilities (e.g. heat always flows 
from the hot to the cold reservoir, never vice versa, every kind of energy can 
be converted into heat, but not vice versa) can directly be explained by the 
fact that the processes generate entropy and their inverted counterpart is 
therefore forbidden by the second law. 

Here we describe an asymmetry between past and future whose connec- 
tion to the second law is more subtle. An extensive analysis of more than 
1000 time series [S] showed that there are many cases where the statistics 
could be better explained by a linear autoregressive model from the past to 
the future and only few cases where regressing the past on the future yields 
a better model O E] • In the context of non-equilibrium thermodynamics it 
has been shown for various physical models (e.g. [TT] . and also in a more 
abstract setting [12]) that statistical asymmetries between past and future 
can be related to thermodynamic entropy production. 

This paper is in the same spirit, but we will try to use only those as- 
sumptions about the underlying physical system that are necessary to make 
the case and try to simplify the argument as much as possible. The ingredi- 
ents are (1) a system interacting with an environment consisting of infinitely 
many subsystems that are initially in a product state, each system having an 
abstract vector space as phase space, (2) linear volume preserving dynamical 
equations for the joint system. We will not refer to any other ingredients 
from physics, like energy levels, thermal Gibbs states, etc. Of course, this 
raises the question of how to define entropy production. Here, we interpret 
the generation of dependencies among an increasing number of particles this 
way. 

To describe the model more precisely, we start with preliminary remarks 
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on statistical dependencies. First we introduce the following terminology. 
Definition 1 (linear models) 

The joint distribution Px,y of two real-valued random variables X and Y is 
said to admit a linear model X Y with additive noise (linear model, for 
short) ifY can be written as 

Y ■=aX + e 

with a structure coefficient a G M and a noise term e that is statistically 
independent of X (X A. e, for short). 

It should be emphasized that statistical independence between two ran- 
dom variables Z, W is defined by factorizing probabilities 

Pz,w = Pz ® Pw , 

instead of the weaker condition of uncorrelatedness, which is defined by fac- 
torizing expectations: 

E{ZW) = E{Z)E{W) . (1) 

Uncorrelatedness between X and e is automatically satisfied if a is chosen to 
minimize the least square error. 

Except for the trivial case of independence, Px,y can only admit linear 
models in both directions at the same time if it is bivariate Gaussian. This 
can be shown using the theorem of Darmois Skitovich [13j, which we rephrase 
now because it will also be used later. 

Lemma 1 (Theorem of Darmois & Skitovich) 

Let Yi,Y2, . . . ,Yk be statistically independent random variables and the two 
linear combinations 

i=i 
i=i 

be independent. Then all Yj with P^^^ /Sf^ are Gaussian. 
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In the context of causal inference from statistical data, it has been pro- 
posed to consider the direction of the linear model as the causal direction 
[Hi [15]. In [8j we have shown that the same idea can be used to solve the 
following binary classification problem: Given numbers Xi, X2, X^, . . . that 
are known to be the values of an empirical time series in their correct or in 
their time reversed order. Decide whether Xi,X2, X3, ... or . . . , X3, X2, Xi 
is the correct order. Certainly, this problem is less relevant than the prob- 
lem of inferring causality since our experiment required to artificially blur 
the true direction even though it was actually known. The motivation for 
our study was to test causal inference principles by applying them to this 
artificial problem. 

To explain our "time direction inference rule" we first introduce an im- 
portant class of stochastic processes fT6]: 

Definition 2 (ARMA models) 

We call a time series {Xt)tez o,n autoregressive moving average process of 
order [p, q) if it is weakly stationary and there is an iid noise with mean 
zero such that 

V Q 

i=i j=i 

For q = the process reduces to an autoregressive process and for p = to a 
moving average process. The short-hand notations are ARMA[p, q) , AR{p), 
and MA{q). The first and the second sums are called the AR-part and the 
MA-part, respectively. 

The process is called causal if 

et±Xt_i V2>0. (2) 

Note that a process is called weakly stationary if the mean E,{Xt) and second 
order moments E,{XtXt+h) are constant in time [16j. In [8j we have shown 
the following theorem: 

Theorem 1 (non-invertibility of non-Gaussian processes) 

// {Xt)tez is a causal ARMA process with non-vanishing AR-part, then 
{X_t)tei is a causal ARMA process if and only if (Xt) is a Gaussian process. 

^ [16] chooses a different definition, but we have argued in [8J that it is equivalent to 
ours. 
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In particular, a process with long-tailed distributions like e.g. Cauchy can 
only be causal in one direction (provided that it has an AR-part). In [S] we 
have postulated that whenever a time series has a causal ARMA model in 
one direction but not the other the former is likely to be the true one, but 
some remarks on the practical implementation need to be made: Testing 
condition yields p-values for the hypothesis of independence. The per- 
formance of our inference method depends heavily on how these p-values are 
used to decide whether a linear model is accepted for one and only one of 
the directions. Our rule depends on two parameters a and S, the significance 
level and the gap, respectively. We say that an ARMA model is accepted 
for one direction but not the other if the p-value for the direction under 
consideration is above a and it is below a for the converse direction and, 
moreover, the gap is at least 6. By choosing a small value a and a large 
value 6 one gets fewer decisions but also the fraction of wrong classifications 
decreases. On 1180 empirical time series from EEGs [8] we where able to 
classify around 82% correctly when the parameters are set to yields decisions 
for about 4% of the time series. When decisions were made for a larger frac- 
tion of time series, the number of correct answers still significantly exceeded 
chance level. Qualitively similar results were obtained for 200 time series 
from different areas, like finance, physics, transportation, crime, production 
of goods, demography, economy, Neuroscience, and agriculture [9]. 



2 Physical toy model 

Here we describe a physical model that suggests that the observed asymmetry 
is an implication of generally accepted asymmetries between past and future. 
We assume that the values Xt as observables of a classical physical systemjl 
For our toy model, we use only two properties of physical models that we 
consider decisive for the argument: 

(1) The state of a system is a point in some phase-space S that is a sub- 
manifold of M". 

(2) The dynamical evolution of an isolated system is given by a family Mf of 
volume-preserving bijections on S. 

Due to Liouville's Theorem, this holds for the dynamics of all Hamiltonian 

^Of course, such an embedding is hard to imagine for time series from stock markets, 
for instance. However, other time series, e.g., EEG-data, are closer related to physical 
observables. 
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systems, other dynamical maps can only be obtained by restricting the joint 
evolution of a composed system to one of its components. 

For simplicity, we restrict the attention to an AR{1) process: 

Xt = 0Xt_i + et. (3) 

We will now interpret as a physical observable of a system S^°\ whose 
state is changed by interacting with its environment. The latter consists of 
an infinite collection of subsystems S^^^ with j e Z \ {0}. Each subsystem is 
described by the real- valued observable Z^^\ Its value at time t is denoted by 
Z^-'\ hence = zf\ but we will keep the notation Xt whenever its special 
status among the variables should be emphasized. 
Then we define a joint time evolution by 

<\ = 711^1^712^1-^^ (4) 
Zfl^ = 72i4°^+ 7224-'^ (5) 



Z[2i = Zt"^ for 0,1. (6) 



The dynamics thus is a concatenation of the map F on M^, given by the 
entries 7^/, with a shift propagating the state of subsystem S^'-'^ to S'^^'^^\ 

The environment may be thought of as a beam of particles that ap- 
proaches site S^^\ interacts with it, and disappears to infinity; we have 
discretized the propagation only to make it compatible with the discrete 
stochastic process. The interaction is given by F. The phase space of the 
systems S^^'^ may be larger than one-dimensional, but we assume that the 
variables z\^^ define the observables that are relevant for the interaction. 
To ensure conservation of volume in the entire phase space, F needs to be 
volume-preserving, i.e. |det(F)| = 1. Since our model should be interpreted 
as the discretization of a continuous time process we assume F G SL(2). 

One checks easily that the above dynamical system generates for t > 
the causal 74i?(l)-process 

Xt = 7ii^t-i + with et := 712-^!°^ , 

if we impose the initial conditions 

Zq^ i.i.d. with some distribution Q (7) 

Actually, it would be sufficient to impose independence only for the non- 
positive j, but later it will be convenient to include also positive values j and 
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assume that the whole ARMA process has a starting time t = 0. This will 
make it easier to track the increase of dependencies over time. The fact that 
every Zq^ is drawn from the same distribution Q ensures that the process 
{Xt)ten is stationary. 

We will now show that, under generic conditions, the dynamics creates 
statistical dependencies between the subsystems. We will later see that this 
is the reason why the time-inverted version of the above scenario would not 
be a reasonable physical model for the process (X^t)- We need the following 
Lemma: 

Lemma 2 (dependencies from sequences of adjacent operations) 

Let r G SL{2) have non-diagonal and diagonal entries. Denote by r["|^-^ 
the embedding into the two-dimensional subspaces of M" that correspond to 
consecutive components 1,1 + 1 with I = 0, . . . ,n — 1, i.e., 

r};!, := © r © , 

where Im denotes the identity matrix in m dimensions. Let P be a non- 
Gaussian distribution on M. Then the application of 

0,1 " 2,3 " " n-2,n-l 

to R"" turns the product distribution P*^" into a non-product distribution. 

Proof: Due to Lemma [1], n-i generates dependencies between the last 
and the second last component. Since none of the other operations acts on 
the last component, the dependence between the last component and the 
joint system given by the remaining n — 2 components, is preserved. □ 

To apply Lemma [2] to our system, it is sufficient to focus on the region 
of the chain on which the dependencies have been generated after the time t 
under consideration. It is given by 

gO,...,t _ ^(0) ^ ^(1) ^ . . . git) (g) 

Its state is given by the variable transformation 

^^t j-lJ-0,1 °-'-i,2 ° - ■ ■ ° ^ t~i,t)[^o r")^o J' l^J 

and all the other sites are still jointly independent and independent of the 
region ([8]). If the relation between Xt and Xt+i is non-trivial (i.e., neither de- 
terministic nor independent) F must have diagonal and non-diagonal entries, 
which implies that is not a product state. 
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The following argument shows that the dependencies between the out- 
going particles is closely linked to the irreversibility of the scenario: The 
fact that the time evolution generates a causal y4i?(l)-process is ensured 
by independence of Z^'^^ , zj: ^\ zj: . . . describing the incoming particles. 
If the variables zj:^^ , z}^^ , . . . are also independent we can run the process 
backwards to induce the causal 74i?(l)-process However, by virtue of 

Theorem [1], this is only possible for (Xt) Gaussian. 

Summarizing the essential part of the argument, the joint distribution 
Pxt,Xt+i has a linear model from Xt to Xf+i but not vice versa because 
the incoming particles are jointly independent but the outgoing particles are 
dependent. Now we show a quantitative relation between the non-linearity 
in backward time direction and the generated dependencies. To this end, we 
measure the strength of the statistical dependencies of the joint system as 
follows. If a system consists of finitely many subsystems its multi-information 
is defined by 

k 

/(Fi, . . . , Yk) := H{Y.j) -H{Y^,..., Y,) . 

i=i 

Here, H{.) is the differential Shannon entropy [17J 

H{Yi, . . . ,Yn) := - j p{yi, ...,yn) logp{yi, yn)dyi, ■■■dyn, 

where p{yi, . . . , yn) denotes the joint probability density of the random vari- 
ables Yi, . . . , Yn. For k = 2, the multi- information coincides with the usual 
mutual information I{Yi : Y2). 

For our infinite system we define multi-information as follows: 

Definition 3 (multi-information) 

The multi-information of the joint system of all S^^'^ at time t is defined by 

I{t) := lim I-m,...,ra{t) , 
m— >oo 

whenever the limit exists. 

Its increase in time can easily be computed: 
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Lemma 3 (multi-information as pairwise information) 

Let the initial state of S^°°-°° satisfy the conditions Then the multi- 
information generated by the process in eqs. to ^ with T G SL{2) satis- 
fies: 

I{t) - I{t - 1) = /(Zf ) : ^) Vt > . 

Proof: We consider the state of the system S^''"'^ at time t that we had 
obtained if the interaction would have been inactive (i.e., F = 1) during the 
last time step. It is described by the transformed variables 

(zf \ . . . , ) := (rS:r^ o rgr^ ° ■ ■ ■ rt\]l)izt\ ■ ■ ■ , 4°^) . (lo) 

Their multi-information coincides with I{t — 1) because the shift part of the 
dynamics is irrelevant. 

The true state of system S'°'" '* at time t is then given by additionally 
applying Fq^^ to eq. (fTOl) . The increase of multi- information caused by ap- 
plying r to system S^^^ and 5**^^^ can be computed as follows. Clearly, the 
joint entropy of the system ' remains constant. Hence the only change of 
multi-information is due to the change of the marginal entropies of S^^^ and 
S'^'^\ Since Tq^^^ also preserves the joint entropy of system S^'^, the increase 
of the marginal entropies coincides with the pairwise mutual information 
created between S^^^ and S^^\ Hence, 

/(t)-/(t-l) = /(zf):ZW), 

where we have used the fact that the state of all systems S^^^ with j > is 
only shifted. □ 

To show the link between the amount of generated dependencies and the 
non-linearity of the backward process, we measure the latter as follows. 

Definition 4 (measuring non-linearity of joint distributions) 

Let L be the set of joint distributions Rx,y that admit a linear model from 
X to Y. Set 

DiPx,y\\L) := M D{Px,y\\Rx,y)^ 

Rx,Y^L 

where D denotes the relative entropy distance fTTj and the infimum is taken 
over all distributions in L. 

Then we have: 
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Theorem 2 (non-linearity of backwards model and multi-inf.) 

Let {Xf) be a causal AR{l)-process and I{t) the multi-information of all the 
"particles" in the toy model given by eqs. 0) to Then, 

m-I{t-l)>D{Px,,x._,\\L). 

Proof: Assume Xt and Xt-i are neither linear dependent nor statistically in- 
dependent because otherwise the bound becomes trivial since we had Pxt,Xt-i ^ 
L. The idea of the proof is the following: we figure out how much the joint 
distribution of Xt and Xt-i has to be modified to admit a linear model from 
Xt to Xt-i. We have already argued that the entire stochastic process would 
admit a linear model in backward direction if all the outgoing particles were 
statistically independent. To obtain a linear model only from Xt to Xt-i 
by reversing the physical toy model it is sufficient to replace S^^^ at time t 
with a system that is independent of the remaining ones. More precisely, 
we replace the joint distribution P of all Z^^^ by the unique distribution P 
for which Zt^^ and the remaining variables are independent but the marginal 
distribution to zf^^ and the rest coincide with F, i.e., 

P := ® P „(o) ^(2) „(2) . 

Then we check how this changes the joint distribution of Xt and Xt-\. The 
inverse dynamics 1 1— >■ t — 1 is given by 



7(0) 


~ ,7(1) 


+ luzr^ 


(11) 








(12) 




= zr' 


for J ^ 0, -1 , 


(13) 



where 7^/ denote the entries of F ^. 
Since Xt = zf^ and 

Xi_i = 7nZf^+7i2^f\ (14) 

which is implied by eq. f lTT]) . the pair [zf^ .zf^) and {Xt,Xt-\) span the 
same probability space (note that both coefficients in eq. f|T^ are non-zero 
because we have excluded the cases of linear dependency and statistical inde- 
pendence). Hence P„(o) ^{i) induces by variable transformation a distribution 

Pxt,Xt-x satisfying 

D[Pxt,Xt-x\\Pxt,Xt-^) = D{Pz(0) ^{1)11^^(0) ^{1)) • 
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The left hand side is an upper bound for the distance of Pxt~i,Xt to a hnear 
model from Xt to Xt-i because Pxt,Xt-i admits such a model. The right 
hand side coincides with the mutual information between zj:^^ and Xt = 
zf^ (since mutual information is known to be the relative entropy distance 
to the product of marginal distributions [T7]), which is exactly the multi- 
information generated in step it — \) ^ i due to Lemma [3l □ 

If Xt is Gaussian, the stochastic process can be obtained without gen- 
eration of multi-information: If C denotes the covariance matrix of the 
pair {Xt,zl ^^), which is diagonal by assumption (because the variables 
are independent and identically distributed), then the generation of multi- 
information is zero if and only if T^CT is diagonal. The easiest case is that 
r rotates the space by some angle a. Even though this dynamics leaves 
the entire joint state of the system invariant, it can induce any stationary 
AR(l)-process. This is because then |0p < 1 in eq. ([3]) and we can thus 
write 



with et := sin aZ^ . 

Note that Gaussian processes can also be realized by a system that does 
generate multi-information. For instance. 



induces the same process (Xt) as a rotation by the angle a, but induces de- 
pendent outgoing particles because T^T is non-diagonal. This shows that the 
correspondence between entropy production and time-inversion asymmetry 
of (Xt) can only consist of lower bounds. 

3 Interpretation 

We first discuss the interpretation of the Gaussian case. To show an even 
closer link to thermodynamics, we recall that Gaussian distributions often 
occur in the context of thermal equilibrium states. For instance, the variable 
position and momentum of a harmonic oscillator are Gaussian distributed in 
thermal equilibrium. Hence we interpret the case of the isotropic Gaussian as 
thermal equilibrium dynamics. The fact that the joint distribution Pxt,Xt+i 



Xt+i = cos aXt + €t 




sm a 

COS""*^ ( 



a 
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coincides with Pxt,Xt-i is exactly the symmetry imposed by the well-known 
detailed-balance condition [IE] that holds for every Gibbs state. 

In order to interpret the scenario in the non- Gaussian case as entropy 
production, we note that the sum over the marginal entropies of the sub- 
systems increase linearly in time. The fact that the joint Shannon entropy 
remains constant loses more and more its practical relevance since it requires 
complex joint operations to undo the dependencies. From a coarse-grained 
point of view, the entropy increases in every step. 

In our experiments we found several examples of time series that could 
better be fit with a causal ARMA model from the future to the past than 
vice versa, even though this was only a minority of those for which a decision 
was made. Of course, there is no contradiction to the second law if this is 
the case. To avoid such misconclusions we discuss which assumptions could 
be violated to generate time series that admit non-Gaussian ARMA models 
in the wrong direction. 

To this end, we list the requirements which jointly make the time-inverted 
scenario of the above dynamics extremely unlikely: 

1. The "incoming particles" (which correspond to the outgoing ones in 
the original scenario) and S^^^ had to be statistically dependentjl 

2. The coupling between 5''-°^ and the incoming particles must be chosen 
such that it exactly removes the incoming dependencies. There is noth- 
ing wrong with dependent particles approaching S^^\ and a coupling 
that destroys dependencies between the particles and S^^^ by creating 
additional dependencies with a third party. However, removing depen- 
dencies in a closed system requires transformations that are specificly 
adapted to the kind of dependencies that are present. In other words, 
the coupling between S^^^ and the incoming particles had to be one 
of the "few" linear maps F G SL{2) needed for undoing the operation 
that created the incoming dependencies. 

We want to be more explicit about the last item and recall that the joint 
state of S^'"' '* at time t is given by 



■^This indicates that they have aheady been interacting earher, of. Reichenbach's princi- 
ple of the common cause [Tl , which is meanwhile one of the cornerstones of causal inference 
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We now run the time inverted dynamics fllll) - fll2l) (starting from t and ending 
at 0) to this input using some arbitrary T G SL{2). The state of S'^'"' then 

where we have defined 

f := f or. 

Due to Lemma [2], this can only be a product state if T has only diagonal or 
only off-diagonal entries (or if Q is Gaussian). This shows that the depen- 
dencies can only be resolved by V if it is adjusted to the specific form of the 
dependencies of the incoming particles. 

This kind of mutual adjustment between mechanism and incoming state 
is unlikely. Similar arguments have been used in causal inference recently 
[ISl [2D]. According to the language used there, the incoming state and the 
coupling share algorithmic information, which indicates that the incoming 
state and the coupling have not been chosen independently^ 

To generate a process {Xt)t^i that admits a linear model in backward 
direction thus requires a different class of dynamical models. For instance, 
the joint dynamics could be non-linear. 



4 Conclusions and discussion 

We have discussed time series that admit a causal ARMA model in forward 
direction but requires non-linear transitions in backward directions to remain 
causal. Since previous experiments verified that some empirical time series 
indeed show this asymmetry, we have presented a model that relates it to 
the thermodynamic arrow of time. 

To this end, we have presented a toy model of a physical system coupled 
to an infinite environment where we linked the asymmetry to the thermody- 
namical entropy production. 

The essential point is that the linearity of the joint dynamics is inherited 
to the forward but not to the backward conditionals. Of course, not every 
physical dynamics is linear. Nevertheless, the result suggests that simplicity 
of the laws of nature is inherited only to the forward time conditionals. Since 

^Note that the thermodynamic relevance of algorithmic information has also been 
pointed out in ^2T]. 
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stochastic processes usually describes the state of a system that strongly in- 
teracts with its environment there is no simple entropy criterion to distin- 
guish between the true and the wrong time direction. Hence, more subtle 
asymmetries as the ones described here are required. 

The asymmetries fit to observations in [22] discussing physical interacting 
models of a causal relation between two random variables X (cause) and Y 
(effect), where P{Y\X) was simple and P{X\Y) complex, which has been 
used in recent causal inference methods [23t [2^ . It should be emphasized 
that such kind of reasoning cannot be justified by referring to Occam's Razor 
only, i.e., the principle to prefer simple models if possible. The point that 
deserves our attention is to justify that Occam's Razor should be applied to 
causal conditionals P(effect| cause) instead of non-causal conditionals like 
P(cause|ef f ect). Studying these asymmetries for time-series highlights the 
relation to commonly accepted asymmetries between past and future. 

Acknowledgements: This work has been inspired by discussions with Ar- 
men Allahverdyan in a meeting that was part of the VW-project "Quantum 
thermodynamics: energy and information flow at the nanoscale" . 
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